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ABSTRACT 

The "Writing What You Read" (WWYR) rubric was 
designed for large-scale assessments, and differs from most narrative 
rubrics in its narrative-specific content and its developmental 
framework* The rubric contains five analytic subscales for theme, 
character, setting, plot, and communication, and a sixth holistic 
scale for overall effectiveness. Evidence of validity was gathered 
for the WWYR scoring rubric through comparison with an established 
narrative rubric that has been demonstrated to be sound. The 
comparison rubric, derived from comparative studies of student 
writing competence, is a holistic/analytic scheme used annually in 
California. Five raters reviewed narrative samples collected from an 
elementary school participating in the Apple Classrooms of Tomorrow 
project. Both rubrics were generally used consistently by raters. 
Results suggest that at least three subscales of WWYR can be used 
reliably and meaningfully in large-scale assessment as long as each 
narrative is rated by two raters. Evidence is lacking for the 
technical soundness of the other scales, and findings further suggest 
that subscale judgments may not provide a technically sound profile 
of students* strengths and weaknesses. One figure and 12 tables 
present study findings* (Contains 23 references.) (SLD) 
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TOWARD THE INSTRUCTIONAL UTILITY OF 

LARGE-SCALE WRITING ASSESSMENT: 
VALIDATION OF A NEW NARRATIVE RUBRIC 

Maryl Gearhart,^ Joan L. Herman,^ John R Novak,! 
Shelby A. Wolf,2 and Jamal Abedil 

In the press to design performance-based writing assessments to serve 
both policy and practice, scoring rubrics have undergone considerable scrutiny 
and revision (Freedman, 1991; Huot, 1991; Paul, 1993; Wiggins, 1993). While 
concerns for large-scale assessment and policy uses have emphasized 
requirements for technical quality — particularly the capacity to support 
interrater agreement — , interest in instructional value and impact on practice 
have highlighted the importance of rubric content and structure. 

Two related issues have emerged in the content dialogue. First, existing 
rubrics cannot adequately represent the important qualities of good writing 
when scales or scale-point criteria are vague, confusing, or inconsistent with 
what is known about well-constructed and effective text (Baxter, Glaser, & 
Raghavan, in press; Paul, 1993; Resnick, Resnick, & DeStefano, 1993; Wiggins, 
1^93; Wolf, 1993). 

Most of the scoring rubrics that I have encountered seem invalid to me. [W)e score 
what is easy, uncontroversial, and typical — not necessarily what is apt for 
identifying exemplary writing or apt for the demands of real-world writing. 
Consider [one state's) . . . descriptor for the top score on the scale [of] 
Organization/Content . . . Little in this scoring system places a premium on style, 
imagination, or ability to keep the reader interested. Only the top score description 
mentions "effective and vivid** responses, instead of those criteria being woven 
through the whole rubric. Yet we see this limitation in almost every writing 
assessment. (Wiggins, 1993, p. 21) 

Second, rubrics that do not reflect the qualities of good vs^riting are limited 
in their instructional utility (Paul, 1993; Wiggins, 1993). If a central purpose of 
assessment is to guide instructional planning, then rubrics for assessing 
student writing must be derived from current English/language arts 
frameworks and must reflect those analyses of the contents, purposes, and 

^ CRESST/University of California, Los Angeles. '-^ CRESST/University of Colorado at 
Boulder. 
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complexities of text. Rubrics must communicate to teachers, students, and 
others what's important in writing performance. 

Certainly the challenges to rubric design are substantial. The purpose of 
the study reported here was to validate a new rubric designed to optimize 
content quality and to enhance instructional value, but whose technical quality 
is unknown. The design of the Writing What You Read (WWYR) narrative 
rubric (Wolf & Gearhart, 1993a, 1993b) was prompted by the need for 
judgments that "chart . . . the course between uniformity of judgment on the 
one hand and representation of complexity and diversity on the other hand" 
(Wolf, Bixby, Glenn, & Gardner, 1991). That need is particularly crucial for 
classroom teachers who are concerned not only with students' present work, 
but with their future growth. Existing narrative rubrics did not, in our 
judgment, have the potential to guide instruction. 

[F]or example, ... in a pilot project to score locally completed work . . . using the [a 
national] rubric, . . . [h]ere is a descriptor for a story that merits a score of 6 (the top 
level): *Taper describes a sequence of episodes in which almost c^ll story elements 
are well developed (i.e., setting, episodes, characters' goals, or problems to be 
solved). The resolution of the goals or problems at the end are [sic] elaborated. The 
events are represented and elaborated in a cohesive way." Surely this is not the best 
description possible of a good story. (Wiggins, 1993, p. 21) 

Surely not. But could the "best description," or even a better description, be 
captured in a technically sound rubric? Our recognition of the "test-maker's 
dilemma" (Wiggins, 1993)— that rubric complexity and face validity could 
result in a loss of technical capacity for large-scale use- -was the impetus for 
the study reported here. 

Issues in the Design of Writing Rubrics 

A first decision involved type of rubric. Three types of scoring rubrics 
have been used in large-scale writing assessment. First, and probably most 
common, is holistic scoring — assignment of a single score reflecting a 
student's competence with all aspects of writing. A second approach is 
primary trait scoring — the construction of a rubric customized to specific 
prompts, A third method is the analytic rubric, in which defined dimensions 
of good writing (e.g., Organization, Content, Style, Voice, and Mechanics) are 
applied across a range of topics within broadly defined genres. 

er|c g 



4 



CRESr T Final Deliverable 





^1 



00 m ' 

0) 2 ^ 2 

a 9 2 

O ^ 3 a> 3 



5' c i 

" .2 o o 

S ^ O DO 

^- O 1-1 > 

to m 

S o."S o 

c " b S 

^3 n CO -S 

• o.^ o 



la 

So - 

^ A O C 
.^T3 8 -3 o 

t3 w £ o 
g- « «2 ^ « 

^ -|^« 

w-i *j c c 

w S).2 g g- 
• ^ 0) a> u 



■2 



!3 



> 0) bO 

: OS 4^ c 
1 fc S 3 



1 



0) b 00 



« a> ea 



.5 ac^ >v £ 



' c - 



■a 5 

I § 

1 ^ 

2 -t « p ;s -3 g 
£ 5 ^ a> C . 

3 S u - c 

C .2 p. 0) 6C«^ • 

rj 5 2 — S ta « 

^ Sii c i i: 2 

• TS 00 0; ^ CO ^ 



OJ 3 



S 2 >»S 
a. 



« .2 ? a> 

0) ea 2 ea 

Im c ^ u 

^ 3 00 0) 

9» 0) o 

• Srs a 



s ^ § 

00 a> ^ 

.5.5 ta 

CD 

« <^ 2 S 
d c Qj c3 

00 00 03 



^ a 

8 S*S 

S S'S 

00 ° C 

o a> o 
C 1^ 00 



a, 

c 2 
2 ^ 



^ 00 S 

^ a> 0* 



c ^ 

li| 

s as 



§2 



.2 go. 
g 2 a u o 

2 ^ bfi 0^ 
S ° S S c 

2 g 3 gJ 3» 

s a'S 

itiii 

— m > O t> 

-3 -o -t^ jfi « 

. 2 S &: § a 



o 



mi 



S .2 ** *^ 

2*5 -a c 

c ^ 

G Q> 

U « CJ U 

C - «3 

3 5 0) ^■ 

S .2^ « 

SlS 00 ^ 

o > a> c 

ao > 3 

2-C S 

• S-n 3 



0) o 
3 

2 tS 

.2-5 
G c ^ 
a* 5 ? 

" 3 £ 

|t3-o 
a t>fi 3 
O 3 C3 

III 



> cs 

; DC o ^ 



o 



cs ^ 00 cs 



ll 



*J .2 " 3 



ii 



'•5 bp s a ^ 



a 

OJ Oi 

o. - 

K 



.:a 3 

?.2 



> Imi CS 
0) > u u 

^ 3 js^'i. 

CO 



:h.2 



2 2 

cSalS 



.a I a 



3 — 

" OS « 

8 o I 

^ 5 .2 ^ 
w 3: « 3 

u 3-a o JO 

-X 2 b-S *- 

• OQ > O cs 



3 

3 O-f 



0, ■ 



2 g c:2 I 
fl> .2 cj V- 

^ «2 U3 O cs 
.2 3^ 

4^ J3 ^ CU'm! 

CS o a M 

§-S ^a S3 

a S 5» 0) a 



'all 



£ 3. a 

2 ^ 

ill! 

4) ^ 00 ' 



4) _^ 00 « X _c 



.a C 3 



t5 I 

3 

3 ^ 



S*^ es « S 

3 e H 3. 
cs a u bo OQ 



sf ?. 4J ^ 
C £ -3 5 

ga'El 

•S 3 c 2 
3 ^ !r -2 



u 3 *• 

4/ O C 

3 > g - 

* 2 5:1 

• 3 « - 



3 0, 
o 

.2 o 

^ 3 

a - 

O Bfl^ 

o 3 £ 

4,?i: 

- c5 3 a 

C-c 2 3 

S CJ 3. 3 



^ « a « 
.s*s 3 c 
5 £r s 



3 * 3 
a m «S u 

^ a c ^ 
a.2|^ 

'u "oo ^ 

a 4) 

4> ■) 00 



.5 

4) 3 .a 

Hi 
nils 

g -C C « 3 2 
3. Q 3 8 o 3 
o. > o 3 cr 

■3 3-g W-X? • 
C 3 ^ eS 

s-5Sa1i 

. 2! &5g a 



» a 

21 

bfi 3 

8 a 



1 



bC C 

.5-2 

g 

c/} 3 



a 8 

3 ^ 

00 -rj 3 

i? S o 
C ? CJ J3 

^ .2^ 
'o C 



a 

as 

bfi OS 

a bi5 



OQ B u 



fl 

io -e 



u-o a 

« 2 2 

a'c a 

oX - 
3 bp r 

1 N 
111 

J si 



« 3 = 3 

• w - 5 s 



& o 



«l £c:-2 

3 • u 3 
u 3 e 3^-3 

-3 §• i 

fe.3 a-c 3. 



C3 



LU 

CO 

<c 

I 

<c 

:> 



I ; I 




ERIC 



Figure L Narrative rubric. 



o - 02 
2 >.T3 g- 
C a 5 c 

0, ^ < 0; 3 

CS 2 J- S 

37* T3 

4) 00 3 «3 X 

S B :3 I- a* — 

bo c8 ^ r ^ 

1-1 

^ o 2j 2 3 
.2 1:^5 a. 



S.2 

4> 3 



§£ 3:5 
£ 4> 



bfi 

3 T «2 

CQ 4)33 4< 



3 * 
.2 « 



«^ ^ 4) 
5|£ 



4> 4>c: 

: 1^ 



4» *^ 3 



2 3 _, o 

g a - § 

o g o ^ 



.2 o >^ 

4, 4» p ^ 



a> a> 



6 — .3 

i e » 

17 5.2 bC.Hl'r 
:3 -o 4; 3^2- 

5l£2 2i. 



2^3 ^ 

3 -^3 a e 5 

S o « g^rS^ i 
« o g 5 « p.? 

4) 41 4> r ^ ^ 

^£ - S i s " 

C ^ e 2 ao 

3 --^^ j: >% 

S.- S 4: - 2f t S 
i: a a a 5 > 
4< a» « -r s 

. £5 a£-3^ 



2 

a K 
4) fc. 

fill 

y 3 "S -5 



2i = 
£ a^H 

-* e * 3 



5 iI 
o t£ 3 

3= 5 5- 



2 a* 



3 -3 

4. « 

3.w'3 2- 



: w o. a< o ^ 



\Vol£/Gf arharvQuellmalz/Whjttaker. 1993 



Program Three, Project 3.1 



3 



Advocates of these scoring approaches debate their efficiency, cost 
effectiveness, and relative value for instructional feedback. Although 
empirical comparisons frequently show significant correlations among 
analytic subscale scores and between holistic and analytic scoring, it is our 
view that concerns about instructional utility press for feedback beyond a 
single score. The scores produced by holistic scoring do not communicate the 
far more complex standards articulated by raters in moderation sessions, and 
theiefore holistic scores are of limited value to the recipients. In contrast, 
analytic scales have gieater instructional potential, in that they communicate 
a differentiated analyses of quality standards. They can do so, however, only if 
the dimensions reflect consensus on the components of valued performances. 
The design of the Writing What You Read rubric was motivated by the concern 
to ground rubric design in current analyses of effective narrative writing. 

The Writing What You Re€id Rubric 

The Writing What You Read rubric (Figure 1) differs from most narrative 
rubrics in its narrative-specific content and its developmental framework 
(Wolf & Gearhart, 1993a, 1993b). Designed for classroom use, the rubric 
contains five analytic subscales for Theme, Character, Setting, Plot, and 
Communication (Figure 1), and a sixth, holistic scale for Overall Effectiveness 
constructed specifically for this technical study (Table 1). Each subscale 
contains six levels designed to match current understandings of children's 
narrative development. The rubric was the product of collaboration with 
elementary teachers, and its use has been shown to impact teachers' 
understandings of narrative (Gearhart, Wolf, Burkey, & Whittaker, 1994). It 
has never been used to date for large-scale assessment. 

The technical language of narrative is integral to the WWYR tool, unlike 
the descriptors of many narrative rubrics that are not unique to this genre. 
V^ords like topic (rather than theme), event (rather than episode), and diction 
(rather than style) create a sense of "genre generality" (Gearhart et al., 1994). 
When narrative components are included, they are usually limited to 
character, setting, and plot, omitting theme — the heart of narrative, a 
comment about life which illuminates the emotional content of the human 
condition. A subscale for organization may not capture the orchestration of 
components. Definitions for the narrative's development may omit the 
communicative aspects of style and tone, focusing instead on logical 
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transitions; although transitions are important and logic is always welcome, 
the communicative aspects of narrative are more centered on creating 
images — using language purposefully, metaphorically, and rhythmically to 
take the reader off the page and into another world. 

WWYR was designed as an alternative to narrative rubrics that are not 
grounded in genre, either in its traditional sense of a classification system for 
organizing literature (a system much subject to change) or in its more current 
sense of social action constrained by particular rhetorical forms. The 
development of character, the symbolism in setting, the complexity of plot, the 
subtlety of theme, the selected point of view, and the elaborate use of language 
ail depend on and are defined by genre. If we are going to teach children about 
narrative and how to grow as young story writers, then surely we would want 
to use more precise language and to provide a fuller picture of what narrative 
is. If we limit or simplify concepts for children (and for their teachers), we 
refuse them access to more intriguing and more authentic possibilities. The 
WWYR rubric is a simplification, of course — how else could something as 
complex as narrative fit on a single page? Yet, its language and focus provide 
a key to a much larger door, opening onto the evocative, emotional, and 
eminently human symbol system of narrative meaning. 

Our Study 

The purpose of our study was to gather evidence of validity for the Writing 
What You Read rubric, through technical comparisons with an established 
narrative rubric that has consistently demonstrated sound technical 
capabilities in large-scale assessments of elementary level writing. Our 
studies addressed a series of questions regarding the technical quality of the 
rubric. 

Reliability: Can the Writing What You Read rubric be applied to scoring of 
classroom narratives with the same levels of rater agreement as an established 
narrative rubric? 

We selected a comparison narrative rubric that has consistently demonstrated 
excellent levels of rater agreement. Can raters make judgments with the 
Writing What You Read rubric at similar levels of reliability? To investigate 
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this question, judges rated classroom narratives with both rubrics, and we 
compared reliabilities. 

Validity of the Writing What You Read rubrio What is the evidence that scores 
derived from the WWYR rubric are meaningful indices of students' narrative 
writing? 

We inferred validity from grade-level differences (scores should increase with 
age), from relationships of scores across types of assessments (e.g., scores 
derived from both rubrics should be correlated), from interscale correlations 
(for both rubrics, the subscale ratings should not be highly intercorrelated), 
from consistency of raters* judgments across rubrics, and from raters' 
confidence in their judgments based on opinions expressed in post-rating 
interviews. 

Procedures 

Site 

The narrative samples were collected from an elementary school that 
served as a longitudinal research site for the national Apple Classrooms of 
Tomorrow (ACOT^m) project. The school is located in a middle class suburb of 
Silicon Valley. 

Datasets 

Narratives were sampled from classroom writing in Grades 1 through 6. 
Students' names and grade levels were removed and replaced with 
identification numbers. Narratives were sorted by level (primary = Grades 1 
and 2, middle = Grades 3 and 4, and upper = Grades 5 and 6), and then 
scrambled within sets. 

Comparison Rubric 

The comparison rubric, derived from analytic scales used in the lEA 
comparative studies of student writing competence, is a holistic/analytic 
scheme (Table 2). (See Quellmalz & Burry, 1983, for description of original CSE 
scales.) In annual use in assessments of students' narratives in a California 
school district, this rubric has also been used extensively in our Center for 
evaluations of elementary students' writing (Baker, Gearhart, & Herman, 
1990, 1991; Baker, Herman, & Gearhart, 1988; Gearhart, Bank, & Herman, 

17 
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1990; Gearhart, Herman, Baker, & Whittaker, 1992; Gearhart, Herman, & 
Bank, 1989; Herman, Gearhart, & Baker, 1994). Consistently demonstrating 
excellent levels of rater agreement and meaningful relationships with indices 
of instructional emphasis, the rubric represents a sound technical approach to 
writing assessment. Four 6-point scales are used for assessment of General 
Competence, Focus/Organization, Elaboration, and Mechanics; in the current 
study, we were concerned just with narrative content, and the raters did not 
apply the Mechanics scale. 

Rating Procedures 

Raters. Our five raters were drawn from three communities. Two raters 
were elementary teachers with experience using the comparison rubric for 
scoring students' narrative writing; one of these raters had considerably more 
experience than the other with district scoring sessions. Two raters were 
elementary teachers experienced with other large-scale efforts; one scored 
elementary narrative and persuasive writing samples in English and Spanish 
for two years as part of a program evaluation, and the other scored writing 
samples of elementary school students in English and Spanish as part of a 
nationally implemented supplemental education program. The fifth rater was 
a research assistant with experience scoring elementary narrative and 
persuasive writing samples in EngHsh and Spanish for program evaluation. 

Rating procedures. In conducting the narrative scoring, raters were 
informed that the samples would represent primary (Grades 1-2), middle 
(Grades 3-4), or upper (Grades 5-6) elementary levels, and that sets would be 
labeled by levels. Raters completed comparison scoring before undertaking 
Writing What You Read scoring. While order of rubric is certainly a variable 
that could impact judgments, we felt that our initial questions regarding the 
Writing What You Read rubric did not require systematic investigation of 
rubric order at this time. 

Each phase of scoring began with study and discussion of each rubric, the 
collaborative establishment of benchmark papers distributed along the scale 
points, and the scoring of at least three papers in a row where disagreement 
among raters on any scale was not greater than 0.5. Raters requested and 
were granted permission to locate ratings at midpoints in addition to defined 
scale points. Training papers for each major phase were drawn from all 
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levels. When raters began the scoring of a given level, they conducted an 
additional training session; raters scored preselected papers independently, 
resolved disagreements through discussion, and placed these "benchmark" 
papers in the center of the table for reference. 

Because the set of papers for Grades 3 and 4 was by far the largest, raters 
rated half of these first, followed by Primary, Upper, and then the remaining 
Middle papers. Raters revisited the Middle-level benchmark papers when 
scoring the second half of that set. Raters rated material in bundles labeled 
with two raters* names; at any given time, each rater made a random choice of 
a bundle to score. The material was distributed so that two raters rated each 
piece independently; scores were entered rapidly, and a third rater rated any 
paper whose scores on any scale differed by more than one scale point. A 
check set of three to eight papers was included halfway through the scoring 
session; any disagreements were resolved through discussion that made 
certain that raters were not changing their criteria for scoring. 

Rater Reflection 

Raters were interviewed at two key points in the session — at the 
completion of the comparison scoring, and at the completion of the final 
Writing What You Read scoring. The comparison interview was conducted as 
a focus group; the final interview was a critique of the two rubrics and was 
conducted with two pairs of raters and one rater alone. Interviews wore 
transcribed for analysis. The protocols for both interviews are contained in the 
Appendix. 

Results 

Rater Agreement 

Reliability: Can the Writing What You Read rubric be applied to scoring of 
classroom narratives with the same levels of rater agreement as comparison 
scoring? 

Rater agreement was examined using percent agreement, correlation 
coefficients, and generalizability coefficients. Because raters utilized midpoint 
ratings, percent agreement was computed for ± 0, 0.5, and 1.0. Analyses of 
agreement, correlation coefficients, and generalizability coefficients were 

2 J 



14 



CRESST Final Deliverable 



based only on the material rated independently and thus excluded ratings 
negotiated during the training or the check sets. 

Correlation coefficients and percent agreement indices were computed for 
each pair of raters, and, for purposes of comparison, those estimates were 
averaged across all pairs of raters. The average percentages of agreement 
should be considered to be descriptive information rather than evidence of 
reliability, since given the small range of possible values and the restricted 
number of scale points, rather high levels of agreement may be expected just 
based on chance alone. Indeed, repeated estimation of agreement indices after 
random permutations of the data indicated that, for these scales and these 
data, the chance levels of agreement for uncorrelated ratings were on the 
order of .16, .44, and .67 for the ±0, ±0.5, and ±1 indices, respectively. The 
introduction of very moderate correlations between ratings are sufficient to 
cause the percentages of adjacent (±1) agreement to approach the ceiling value 
of 1.00. The average correlations can be interpreted much like classical 
reliability coefficients, with the difference that instead of estimating the 
correlation between parallel forms of a test (as in classical reliability thee y), 
we are estimating the correlation between parallel ratings of a single test. 

Interrater reliability for both rubrics was also assessed through the use of 
generalizability theory, a powerful and appropriate methodology for 
addressing issues of rater agreement. For purposes of discussion here, a 
generalizability coefficient can be considered to be analogous to the classical 
reliability coefficient. Both can be computed as ratios of variances. A 
reliability coefficient is the ratio of an examinee's true score variance to the 
observed score variance, and it is an estimate of the correlation between scores 
on parallel forms of a test. Generalizability coefficients are ratios of variance 
due to the objects of measurement (in our case, students' essay scores) to the 
total variance due to the objects of measurements and the conditions of 
measurement (in our case raters). They are estimates of the correlations 
between observations obtained under different conditions of measurement (by 
different raters). 

Generalizability theory is much more flexible than classical reliability 
theory in that generalizability coefficients can be tailored to suit the particular 
purposes of an evaluation. For example, separate generalizability coefficients 
V in be computed for relative and absolute decisibiis. If one were interested 

ERLC 
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mainly in accurately ranking a set of essays, then a relative coefficient would 
be of interest. If relative generalizability is high, then one can be confident that 
two different raters scoring the same set of essays would create consistent 
rank orderings of the essays. That is, there would be a high degree of 
agreement on which essay is the best, which is second best, etc. On the other 
hand, if one is making decisions about proficiency by comparing scores to an 
absolute standard, such as a cut score, or is comparing scores assigned by 
different raters, then an absolute coefficient is more appropriate. This type of 
coefficient takes into account the variance that is due to differences between 
raters. If, in the scenario presented above, relative generalizability were high 
but absolute generalizability were low, then it would be difficult to have 
confidence in comparisons between means of sets of essays rated by different 
raters. 

Another advantage of generalizability theory is that it is easy to extend tho 
results of a generalizability study (G-study) to what is called a decision study 
(D-study). In classical test theory, the reliability of the test is a function of the 
length of the test; longer tests are more reliable, and the reliability of a test can 
be improved by adding more items. The analogous procedure in a rating 
situation is to improve reliability by adding more raters, multiply scoring each 
essay, and aggregating the results. The G-study coefficients in our study can 
be interpreted as reliability indices for scores based on a single rater. If those 
coefficients are too low, then a D-study can be done to examine the effects on 
generalizability of adding more raters. An informed decision can then be 
made as to how many raters should be used to attain adequate levels of 
generalizability. 

The design for the G-study for this paper utilized essays as the object of 
measurement, and raters as conditions of measurement. In the parlance of 
generalizability theory this is a single-facet model, and we are interested in the 
generalizability of scores across raters. Three variance components must be 
estimated: those due to essays, raters, and the rater-by-essay interaction. In 
the ideal situation, all papers would be read by all raters, and the estimation of 
these components would be rather routine. Since that was not the case m this 
study, it was necessary to lake additional steps in order to obtain stable* 
estimates. A more thorough treatment of generalizability theory in general, 
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and the procedures used to entimnte variance components for this study in 
particular, may be found in Novak & Abedi (in preparation). 

Percentages of agreement. Patterns of rater agreem*^^ Mffered between 
rubrics. While overall ngreement for comparison ratings (Table 3) was 
generally satisfactory, it was lower and more variable across rater pairs than 
reliabilitioH achieved for previous studies (Gearhart, Herman, & Baker, 1992; 
Baker, Gearhart, & Herman, 1990, 1991). Rater agreement for the Writing 
What You Read ratings was generally acceptable, and somewhat higher and 
more consistent than that for comparison ratings. It was also, however, 
somewhat lower than the very high rates of agreement we have obtained for 
the comparison rubric in prior studies. There were no consistent differences 
among rater pairs in levels of agreement, nor any evident patterns among the 
subscales in levels of agreement. 

While the agreements reported in these tables were certainly satisfactory, 
they wore not exemplary. The patterns of rater agreement obtained here may 
have been impacted by study purpose: Raters were informed from the outset 
that they were participating in a study of two narrative rubrics, and they were 
atypically slow, methodical, and analytic in their approach to scoring, raising 
and pursuing issues that are often handled quickly and dismissed in 
moderation sessions. We suspect that moderation discussions confronted the 
raters with the complexity and uncertainty of the rating process. 

Pearson correlations. The average correlations for the Overall, 
Character, and Communication scales for the WWYR rubric (Table 4) are 
quite comparable to those obtained for the three subscales for the Comparison 
rubric (Table 3), while those for the Theme, Setting, and Plot subscales were 
somewhat lower. The Plot scale was particularly problematic, with an 
average correlation of .48. Correlations across rater pairs were generally 
more consistent for the WWYR rubric, although this may be due largely to 
more stable estimates resulting from the larger number of papers that were 
scored using the WWYR rubric. Note that for the Comparison rubric the 
lowest correlations were obtained for the one and five pairing of raters (.28 and 
.25 for the General Competence and the Focus/Organization scales, 
respectively); those estimates, however, were based on a sample of only twelve 
papers. 
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Table 3 

Rater Agreement: Comparison Rubric 



Development/ 

Index and Raters General Competence Focus/Organization Elaboration 



Pearson correlation 
coefficients 



ivaters i ana z vin — io; 


CI * 

.u i. 




63** 


Rflford 1 and n fN=:21) 


.56** 


.57** 


.39 


Raters 1 and4(N=18) 


.82** 


.78** 


Y J ** 


Raters 1 and 5 (N= 12) 


.28 


.25 


.70* 


Raters 2 and 3 (N=20) 


.79** 


Y J** 


.73** 


Raters2and4(N=18) 


.73** 


.56* 


.51* 


Raters 2 and 5 (N= 16) 


.84** 


.59** 


.69** 


Raters3and4(N=:18) 


.88** 


.85** 


.90** 


Raters3and5(N=:15) 


.61* 


.53* 


.46 


Raters 4 and 5 (N= 19) 


.73** 


.62** 


.61** 


Average 


.68 


.60 


.63 



P ercent ag reement ±0 



Raters 1 and 2 


.50 


.50 


.44 


Raters 1 and 3 


.38 


.38 


.24 


Raters 1 and 4 


.39 


.39 


.39 


Raters 1 and 5 


.33 


.08 


.42 


Raters 2 and 3 


.40 


.15 


.20 


Raters 2 and 4 


.28 


.28 


.33 


Raters 2 and 5 


.38 


.13 


.13 


Raters 3 and 4 


.56 


.44 


.39 


Raters 3 and 5 


.20 


.20 


.20 


Raters 4 and 5 


.32 


.26 


.32 


Average 


.37 


.28 


.31 



*p<.01, **p<.05. 
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Table 3 (continued) 



Development/ 

Index and Raters General Competence Focus/Organization Elaboration 



Percent agreement ±0.5 








Raters 1 and 2 


.83 


.72 


.72 


Raters 1 and 3 


.71 


.57 


.52 


Raters 1 and 4 


.78 


.78 


.72 


Raters 1 and 5 


.50 


.50 


.58 


Raters 2 and 3 


.70 


.75 


.50 


Raters 2 and 4 


.67 


.61 


.72 


Raters 2 and 5 


.81 


.69 


.75 


Raters 3 and 4 


78 


.67 


.89 


Raters 3 and 5 


.73 


.40 


.67 


Raters 4 and 5 


.79 


.74 


.63 


Average 


.73 


.64 


.67 



Percent agreement ±1.0 








Raters 1 and 2 


.94 


.94 


.94 


Raters 1 and 3 


.86 


.81 


.81 


Raters 1 and 4 


1.00 


1.00 


1.00 


Raters 1 and 5 


.67 


.75 


.83 


Raters 2 and 3 


1.00 


.95 


1.00 


Raters 2 and 4 


.89 


.94 


.89 


Raters 2 and 5 


l.GO 


1.00 


1.00 


Raters 3 and 4 


1.00 


.94 


1.00 


Raters 3 and 5 


.87 


.87 


.93 


Raters 4 and 5 


1.00 


1.00 


1.00 


Average 


.92 


.92 


.94 



♦ p<.01. **p<.05. 
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Table 4 

Rater Agreement: V/ritin^ What You Read Rubric 



Level 


Overall 
Effective- 
ness 


Theme 


Character 


Setting 


Plot 


Commu- 
nication 


Pearson correlation 
coefficients 














Raters 1 and 2(N:=48) 


.'SI** 


.52** 


.56** 


.47** 


.55** 


63** 


Raters 1 and 3 (N=:48) 


.75** 


.64** 


.80** 


.47** 


.71** 


.75** 


Raters 1 and 4 (N=:27) 


.80** 


.77** 


.79** 


.67 


.71 


.82** 


Raters land 5 (N=:37) 


.75** 


.61** 


.69** 


.f/8** 


.57** 


.50** 


Raters 2 and 3 (N=59) 


.60** 


.41** 


.64** 


.49** 


.50** 


.66** 


Raters2and4(N=53) 


.70** 


.60** 


.77** 


.59 


.66 


.77 


n i. O^^JC/ XT COX 

Raters 2 and 5 (N=5o) 


.bl 




.o 1 


1 A 




.Oo 


Raters 3 and 4 (N=42) 


.52** 


.64** 


44** 


^ o 3k 3k 

.48 


.4 C 3k )|( 

.45*^ 


r 03k 3k 


Raters 3 and 5 (N=93) 


.54** 


.61** 


cr o 3k 3k 

.58 


j4 A3k 3k 


nn3k 3k 


.O / 


Raters 4 and 5 (N=44) 


.64** 


.56** 


^ 1 3k A 

.71** 


C O 3k 3k 

.53** 


c o 3k 3k 

.58* 


>l 3k 3k 

.64** 


Average 


•64 


•59 


•66 


•48 


•57 


•66 


Percent agreement ±0 














Raters 1 and 2 


.44 


.38 


.40 


.44 


.40 


.42 


Raters 1 and 3 


.52 


.46 


.55 


.50 


.48 


,42 


Raters 1 and 4 


.56 


.52 


.56 


.56 


.48 


.56 


Raters 1 and 5 


.41 


.46 


.47 


.51 


.41 


.32 


Raters 2 and 3 


.41 


.29 


.31 


.34 


.32 


.51 


Raters 2 and 4 


.4U 




Ad 


.Do 


Ao 




Raters 2 and 5 


.52 


.34 


.47 


.31 


.26 


.47 


Raters 3 and 4 


.43 


.38 


.29 


.31 


.29 


.38 


Raters 3 and 5 


.42 


.42 


.36 


.46 


.43 


.44 


Raters 4 and t5 


.45 


.45 


.43 


.55 


.41 


.39 


Average 


.46 


.41 


•43 


.46 


•39 


.44 



* p<.01. **p<.05. 
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Table 4 (continued) 



Level 


Overall 
Effective- 
ness 


Theme 


Character 


Setting 


Plot 


Commu- 
nication 


Percent a^eement ±0.5 














Haters 1 and z 


. / / 


. iO 




79 




7*1 


Haters 1 and o 


.00 


.D«7 


A1 


7K 


77 


7Q 


Haters 1 ana 4 


Mo 


. # 4 


fif^ 


.Ox 


,00 


8Q 


Haters i and o 






fi7 


.vo 




70 


Haters and o 


fin 

.OU 


. i O 




8.9 

.oo 


.I/O 


85 


Haters ^ and 4 


fit; 


77 




7Q 


79 


89 


Raters 2 and 5 


.88 






.OO 


.OU 


.OD 


iVdl/Ci O O ClilU *C 


.86 


.79 


.64 


.71 


.76 


.81 


1?o^A«*cf Q on/1 ^ 




73 


.71 


.72 


.82 


.84 


rvdi/ers diiti ij 




.68 


.77 


.77 


.89 


.82 




•85 


.72 


.72 


.71 


.76 


.82 


r^ercent acreenienL xi.u 














naiers i ana z 




QO 


91 


.85 


.96 


.96 


ivaters i ana o 






Q6 


.94 


.96 


.96 


Raters 1 and 4 


1 nn 


1 00 


1 00 


1.00 


1.00 


1.00 


Kaiers i ana o 






97 


.97 


.92 


.97 


rvHters ^ ana o 






93 

.4/1/ 


.92 


.93 


.97 




1 00 


.92 


.98 


.94 


.96 


.98 


Ratprs 2 and 5 


.95 


.97 


.86 


.90 


.95 


.98 


Raters 3 and 4 


.93 


.95 


.90 


.95 


.95 


.95 


Raters 3 and 5 


.96 


.95 


.97 


.94 


.95 


.98 


Raters 4 and 5 


.95 


.95 


.98 


.93 


.95 


.93 


Average 


.96 


.95 


.95 


.93 


.95 


.97 



* p<.01. **p<.05. 
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Generalizability coefficients. Table 5 shows the proportions of variance 
attributable to Essays, Raters, and to the Essay-by-Rater mteraction, and the 
resultant generalizability coefficients. Coefficients for both relative and 
absolute decisions are reported. Note that for both rubrics the proportion of 
variance due to Raters is almost negligible. This indicates quite good 
consistency in the application of the scoring rubrics across raters, and has 
very positive implications with respect to the feasibility of using scores based on 
these rubrics to make absolute decisions about students' proficiencies, such as 
assignments to proficiency categories based on cutpoints, or comparisons of 
scores assigned to students by different raters. If the variance due to raters 
were large, then we would have very little assurance that scores assigned to 
students by different raters were based on the same scale. That is, if this were 
the case, then we could not be confident that a 3 given by one rater indicated the 
same level of proficiency as a 3 given by another rater. It is possible for raters 
to agree perfectly with respect to relative decisions and still not agree well with 
respect to absolute decisions. For example, if two raters scored a set of papers, 
and one rater always gave each paper a score that was 3 units higher than that 
awarded by the other rater, then the relative generalizability of those two raters 
would be perfect, while the absolute generalizability would be low. This is not 
the case here, however, and the very small varfance components for raters 
ensure that the generalizability coefficients for relative and absolute decisions 
will be quite close together, as we see in Table 5. 

Comparisons across rubrics. Comparing across rubrics and scales, we 
see that the G-coefficients for the Comparison rubiic scales tend to be 
consistently higher than those for the WWYR rubric. G-coefficients for the 
Comparison rubric are quite consistent across scales, while there is 
considerable variation in the generalizability for the WWYR subscales, with 
the Setting subscale the most problematic with an estimated generalizability 
coefficient of 0.47. 

D-study coefficients. If we compare the results in Table 5 with those in 
Tables 2 and 3, we see that the generalizability coefficients agree closely with 
the average Pearson correlations. The generalizability coefficients for relative 
decisions reported in Table 5 can be interpreted as reliability coefficients for 
scores based on a single rater, and those estimates are somewhat lower than 
would be desired. Although there are no cut-and-dried guidelines for what 
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Table 5 

Generalizability Coefficients 





Variance components 


Generalizability 


coefficients 


Rubric 


Scale 


E 


R 


ER 


Relative 


Absolute 






0.68 


0.00 


0.32 


0.68 


0.68 




Competence 












Comparison 


Focus/ 


0.63 


0.01 


0.36 


0.64 


0.63 




Organization 














Dpvplnnmpnty 


0.66 


0.01 


0.34 


0.66 


0.65 




Elaboration 














Overall 


0.60 


0.01 


0.40 


0.60 


0.59 




Theme 


0.55 


0.04 


0.41 


0.57 


0.55 


WWYR 


Character 


0.62 


0.01 


0.37 


0.63 


0.62 




Setting 


0.47 


0.00 


0.53 


0.47 


0.47 




Plot 


0.55 


0.00 


0.45 


0.55 


0.55 




Communication 


0.62 


0.00 


0.37 


0.63 


0.63 



Note. Standardized variance component estimates for Essay (E), Rater (R), and the Essay-by- 
Rater interaction (ER), and the generalizability coefficients derived from those estimates, for 
each of the Comparison and WWYR scales. 



determines an adequate level of reliability, most researchers would probably 
like to see reliabilities of at least .75. The generalizability coefficients for both 
rubrics fall well below that threshold. The next step within the context of 
generalizability theory was to use the results of the G-study to perform a D- 
study in order to determine how to attain an acceptable reliability level. Table 6 
reports D-study generalizability coefficients for scores based on 1, 2, 3, and 5 
raters. 
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Table 6 

D-study Coefficients 







Relati 


ve 




Absolute 


Rubric 


oCaie 


1 


2 


3 


5 


1 

X 


o 


Q 
O 


K 
O 




General 
Competence 


0.68 


0.81 


0.86 


0.91 


0.68 


0.81 


0.86 


0.91 


Comparison 


Focus/ 
Organization 


0.64 


0.78 


0.84 


0.90 


0.63 


0.77 


0.84 


0.89 




Development/ 
Elaboration 


0.66 


0.80 


0.85 


0.91 


0.65 


0.79 


OCT 

0.85 


0.90 




Overall 


0.60 


0.75 


0.82 


0.88 


0.59 


0.75 


0.81 


0.88 




Theme 


0.57 


0.73 


0.80 


0.87 


0.55 


0.71 


0.79 


0.86 


WWYR 


Character 


0,63 


0.77 


0.83 


0.89 


0.62 


0.77 


0.83 


0.89 




Setting 


0.47 


0.64 


0.73 


0.82 


0.47 


0.64 


0.73 


0.82 




Plot 


0.55 


0.71 


0.79 


0.86 


0.55 


0.71 


0.79 


0.86 




Communication 


0.63 


0.77 


0.83 


0.89 


0.63 


0.77 


0.83 


0.89 



Note. D-study generalizability coefficients for relative and absolute decisions for essay 
scores based on 1, 2, 3, or 5 raters. 



The results of the D-study show that for all of the Comparison subscales 
and for three of the WWYR subscales, adequate reliability (as defined above) 
can be obtained through the use of two raters. Note, however, that for the 
WWYR Setting subscale, even the use of three raters is not sufficient to ensure 
a reliability level of .75. Using four raters would result in a coefficient of .78 for 
this scale. Again, due to the very small proportions of variance attributable to 
the Rater main effects, results and interpretations for relative and absolute 
decisions are nearly identical. 
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VaHdity 

Validity of the Writing What You Read rubric: What is the evidence that scores 
derived from the WWYR rubric are meaningful indices of students' narrative 
writing? 

This section contains four analyses of the Writing What You Read 
rubric's capacity to produce meaningful results: (a) comparisons of students' 
scores across grade levels (scores should increase with grade level); 
(b) intercorrelations of subscales within rubrics (for each rubric, subscales 
shoald not be highly correlated); (c) correlations of ratings across rubrics 
(WWYR scores should correlate significantly with comparison scores); (d) an 
analysis of decision consistency across rubrics (raters should make similar 
decisions about students' competence across rubrics). All ratings contributed 
to these results: Paper scores were computed as the average of the 
independent ratings or the resolved score achieved through discussion during 
the training and check sets. 

Grade level comparisons. Tables 7 and 8 contain descriptive statistics for 
each rubric and, for each subscale, the results of ANOVAs by Level. For each 
rubric, there were score differences in the expected direction by grade level. 
The pattern of score differences was the same for all scales and both rubrics, 
although the ANOVA result for one WWYR subscale (Plot) was not 
significant. 

Intercorrelations of subscales within rubrics. Tables 9 and 10 contain 
intercorrelations of subscales for each rubric. All subscales were highly 
correlated, indicating that raters were not making highly differentiated 
judgments about a narrative's competence along each dimension. Based on 
these results, subscales for both rubrics are not empirically distinct. 

Correlations of ratings across rubrics. Table 11 contains intercorrelations 
of subscales for each rubric. Across rubrics, scores were highly 
intercor related, although the correlations were lower in magnitude than the 
within-rubric correlations (Tables 9 and 10). 
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Table 7 

Descriptives, Comparison Rubric 







Subscale 






General 


Focus/ 


Development/ 


Level 


Competence 


Organization 


Elaboration 


Primary {N=16) 








Mean 


2.05 


2.29 


2.27 


SD 


.47 


.48 


.45 


Middle (N=36) 








Mean 


2.58 


2.68 


2.79 


SD 


.55 


.50 


.59 


Upper {N=n) 








Mean 


3.54 


3.66 


3.67 


SD 


.49 


.67 


.57 



Note. For this analysis, N = number of subjects. ANOVAs examining 
differences among Levels for each scale: General Competence, F (2,66) = 
36.380, p < .0001; Focus/Organization, F (2,66) = 29.136, p < .0001; 
Development/Elaboration, F (2,66) = 26.978, p < .0001. 



Table 8 

D» ocriptives, Writing What You Read Rubric 



Subscale 



Level 


Overall 


Theme 


Character 


Setting 


Plot 


Commun- 
ication 


Primary (N=n) 














Mean 


2.29 


2.47 


2.15 


2.27 


2.44 


2.33 


SD 


.39 


.48 


.53 


.42 


.49 


.44 


Middle {N=36) 














Mean 


2.50 


2.61 


2.40 


2.49 


2.55 


2.51 


SD 


.44 


.45 


.53 


.43 


.47 


.49 


Upper {N=20) 














Mean 


2.87 


3.02 


2.78 


2.73 


2.80 


2.96 


SD 


.59 


.64 


.74 


.51 


.64 


.64 



Note. For this analysis, N = number of subjects. ANOVAs examining differences among 
Levels for each scale: Overall, F(2.70) = 7.113, p < .002; Theme, F(2,70) i= 6.105, p < .004; 
Character. F(2.70) = 5.445, p < .006; Setting, F(2,70) = 4.929, p < .01; Plot, F(2,70) = 2.473, p < 
.092; Communication, F(2,70) = 7.519, p < .001. 
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Table 9 

Subscale Correlations, Comparison Rubric (7V=184) 



Level and Subscale 

Primary (N=36) 

General Competence 
Focus/Organization 



Subscale 



General Focus/ Development/ 

Competence Organization Elaboration 



.80" 



.81* 
.74* 



Middle (iV=115) 

General Competence 
Focus/Organization 



.87'< 



.90* 
.80* 



Upper (N=35) 

General Competence 
Focus/Organization 



.9V 



.86* 
.82* 



Overall (iV=184) 

General Competence 
Focus/Organization 



.91'' 



.92* 
.85* 



*p < .001. 
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Table 10 

Subscale Correlations, Writing What You Read Rubric (^=187) 



Scale 



Commun 

Scale Overall Theme Character Setting Plot ication 



Primary (A^=37) 



Overall 


.88* 


.86* 


.86* 


.86* 


.86* 


Theme 




.85* 


.73* 


.87* 


.83* 


Character 






.77* 


.82* 


.81* 


Setting 








.77* 


.82* 


Plot 










.82* 



Middle (Ar=112) 



Overall 


.92* .91* 


.87* 


.93* 


.94* 


Theme 


.88* 


.81* 


.89* 


.89* 


Character 




.85* 


.88* 


.88* 


Setting 






.82* 


.81* 


Plot 








.92* 


»er {A^=38) 










Overall 


.94* .90* 


.92* 


.95* 


.97* 


Theme 


.90* 


.91* 


.93* 


.95* 


Character 




.83* 


.91* 


.91* 


Setting 






.89* 


.92* 


Plot 








.94* 



Total W=187) 



Overall 


.93* 


.91* 


.89* 


.93* 


.94* 


Theme 




.90* 


.84* 


.90* 


.90* 


Character 






.84* 


.89* 


.89* 


Setting 








.84* 


.85* 


Plot 










.91* 



*p < .001. 



ERIC 



2B 



CRESST Final Deliverable 



Table 11 

Correlations Across Rubrics 



Writing What You Read Scale 



Comparison 
Scale 



Overall Theme Character Setting Plot 



Priniary(iV=36) 

General 
Competence 

Focus/ 

Organization 

Development/ 
Elaboration 

Middle (iV=107) 

General 
Competence 

Focus/ 

Organization 

Development/ 
Elaboration 

Upper (iV=33) 

General 
Competence 

Focus/ 

Organization 

Development/ 
Elaboration 

Total (iV=176) 

General 
Competence 

Focus/ 

Organization 

Development/ 
Elaboration 



.62*** 

.46* 

.62*** 

.79*** 

74*** 

74*** 

.65*** 
.67*** 

.75** 
.67** 
.72** 



Commun- 
ication 



.61*** 

.54** 

.60*** 

.75*** 
.68*** 
.71*** 

.55*** 
.62*** 

.73** 
.66** 
.70** 



.70" 



.44" 



.65" 



.70" 



.72" 



.59" 



.71* 



.74" 



.59* 



.43* 



.60" 



.65* 



.68" 



.56" 



.60" 



.66" 



.64** .58" 



.69" 



.56" 



.75*** .71*** .72" 



.71" 



.64* 



.66" 



.70" 



.74" 



.65" 



.65" 



.67" 



.62" 



.66" 



.68" 



.58" 



.61*** .65*** .65*** .64" 



.77" 



.68" 



.74" 



.73* 



54*** 



.68" 



.74" 



.68" 



.73* 



V<.05. p<.01. p<.001. 
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Decision consistency across rubrics. To c xamine consistency in raters' 
judgments of narrative competence across rubrics, we cross-classified scores 
for General Competence (comparison) and Overall Effectiveness (WWYR) 
(Table 12). These results must be interpreted in the context of two important 
issues. First, although both rubrics are 6-point scales, their scale points do not 
correspond in meaning; in particular, the WWYR rubric is developmental and 
is not intended to locate competency at any particular level. Second, although 
the "best fit" for WWYR's definition of a competent narrative may be Level 3 
("One episode narrative (either brief or more extended) which includes the four 
critical elements of problem, emotional response, action, and outcome. . . . "), 
the criteria for this level were considered unclear by our raters, as we discuss 
below. 

We chose a WWYR mean rating of 3 or above as evidence of competence, 
and compared WWYR judgments against comparison ratings of 3.5 or above, 
consistent with the comparison rubric's distinction between a "developing 
writer" (Level 3) and a "competent writer" (Level 4). Most papers were judged 
as lacking in competence. Raters agreed in their classifications of 146 of 176 
papers (Pearson, p<. 00001). However, there was no consistent agreement in 
classification of "competent" papers: Of the 55 papers judged as compei;ent 
with either rubric, only 25 were classified as competent with both rubrics. 



Table 12 



Cross-Classification of Comparison and WWYR 
Scores (N=176) 



Comparison 
General Competence 



WWYR 
Overall Effectiveness 



<3.0 



= or > 3.0 



< 2.5 



121 



14 



= or > 2.5 



16 



25 



Note, For each rubric, each paper was scored by at 
least two raters; paper scores were computed as nhe 
mean of all raters' judgments. 
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Raters' Reflections 

What are raters' views of the utility and validity of the comparison and Writing 
What You Read rubrics? 

Raters were interviewed at two points in the rating process— following 
comparison ratings (a focus group discussion) and following completion of 
ratings with both rubrics (an interview with pairs of raters). At each 
interview, raters scored sample narratives and discussed the fit of the rubrics 
to the papers. The results reported below highlight the raters' comparisons of 
the WWYR to the comparison rubric. Raters raised concerns regarding rubric 
content, ease of use, instructional potential, and feasibility for large-scale 
scoring. 

Rubric content. Raters offered a balanced appraisal of the strengths of 
each rubric. Raters viewed WWYR as more comprehensive in its analysis of 
narrative, more "positive" in each of its scale-point definitions (more specific 
about narrative qualities and less "negative" or comparative), and more 
complete in its analysis of a narrative's "development." The content "missing" 
in the comparison Development/Elaboration subscale was first discussed even 
prior to the raters' introduction to WWYR, when raters explained that they 
had added content that they considered central to their judgment of narrative: 
"I put feeling under Elaboration. I know it's not, but . . . you need to." "There's 
a big difference between actually seeing something visually and feeling 
something . . . [SJomething can be Vivid,' and something can be ^elaborate,' 
but it might not make you feel emotionally." 

In their critique of WWYR content, raters focused on Plot, Overall 
Effectiveness, Communication, and the absence of a scale like comparison's 
Focus/Organization. Plot and Overall Effectiveness were seen as weak in their 
middle sections, handling ineffectively those narratives that contained a series 
of incomplete episodes. Communication was considered helpful in 
pinpointing particular techniques, but its emphasis on language choices 
"appropriate to the narrative" made it difficult for the raters to give a child 
credit for Htylistic strength that did not necessarily contribute to the narrative. 
In addition, they felt that Communication could be differentiated — at least for 
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instructional applications — as separate subscales for style, tone, and voice. ^ 
Finally, raters missed using comparison's Focus/Organization scale. While 
this comparison scale was seen as rather dry and perhaps exposition-like, it 
captured for these raters a dimension of organizational competence missing in 
WWYR. 

Raters felt that neither rubric was able to capture a narrative's local 
strengths: "Maybe they have one character description, or a setting, or 
something funny, and you laugh, but it really doesn't allow itself to be 4 and 
you want to tell them, *Hey, you made me laugh here, or look at ail these 
similes you were using.' " Similarly, some raters felt that neither rubric 
represented creativity very well: "There might be some idiosyncratic quality or 
some uniqueness about it, some originality that you can't really score." 
Wanting to "give credit" to a child for a moment of insight, humor, language 
use, or cleverness, they suggested providing a place on the rating form for 
personal comments to each writer on strengths and weaknesses. 

Ease of use. Although most raters felt that application of the WWYR 
rubric was a slower, more "analytical" process than comparison rating, only 
one of the five raters remained uncomfortable: "(The WWYR rubric is] so 
broken apart, analytic, that it confuses me." Indeed, the WWYR rubric did 
contain a greater number of scales and detail at each scale point, and, for this 
rater, the constructs required explication ("explicit and implicit, didactic and 
revealing . . . it's too much to keep track of). For the remaining raters, the 
acknowledged difficulty of WWYR scoring was balanced with enjoyment. 

It was much easier, much more enjoyable to use the WWYR to score it. Because (the 
rubric) talked about the different subtleties of language and the different styles and 
emotions that you could use to make it more sophisticated and improve it. Whereas 
the comparison didn't really give that feeling . . . (Llanguage . . . just seemed like 
a skill rather than a quality of the work. 

Raters also appreciated the specificity of the WWYR rubric. Four of the 
five raters reported difficulty anchoring comparison judgments based on 
comparative criteria: "This *few, many, little, and more' kind of vocabulary 
. . . was really a problem in the beginning , . . What is *many?' What is *few?' 



An early version of the WWYR rubric in fact contained thoHO dimensions. Copies of the 
rubric draft are available from the authors. 

4; 
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We had to make our own kind of interpretations, and then compare as we went 
on reading.*" Wishing for more positive and specific descriptions, one rater 
commented: "What is the paper doing, even though there might be 
inappropriate [language]. . . . *No development of narrative elements' — what 
can you say instead of that?" To adapt, rateis reported several strategies for 
resolving uncertainty: expanding the list of comparison criteria (the addition 
of "emotion" to Development, as discussed above); making iterative 
comparisons with higher and lower scale points; using the anchor terms in 
the left column; making an initial dichotomous judgment between 
"Developing" (1-3) and "Competent" (4-6) writer and then refining the decision. 
WWYR, in contrast, supported greater focus on the fit of a narrative to the 
characteristics listed at a given level. 

The raters' response to WWYR was encouraging. Their relative comfort 
indicated that a two- to three-hour WWYR training segsion can be adequate for 
many raters, if they are experienced with scoring and knowledgeable about 
narrative. Raters did offer suggestions for improvements of the WWYR rubric 
that would have facilitated scoring for them: highlighting key terms, listing 
criteria as bullets, and adopting overarching descriptors like those in 
comparison's left column (e.g.. Developing Writer, Competent Writer). 

Instructional potential. Most raters viewed the WWYR rubric as having 
far more instructional potential than the comparison rubric, and those four 
raters who were classroom teachers planned to utilize it in some form in their 
classrooms. For example, one of the comparison Valley raters commented: 

(WWYR] allows you to compliment other strengths, and their styles . . . It's 
wonderful to have it for a teacher resource to direct the children, and the parents . . . 
When Vm scoring kids [with comparison], Tm having a hard time putting into 
words what I want them to do. With WWYR, I could get up and directly teach a 
lesson. 

But one of the four teachers felt that WWYR demanded more analysis 
than she could routinely or profitably undertake in the classroom. For this 
rater, difficulty of use limited instructional potential: "For many teachers, you 
have to give them something that's easy to apply, an easy tool that we can use. 
. . . Not too much analyzing, not too much re-reading. Something automatic, 
I would like a tool like that ... for our daily writing." A rubric with content as 
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complex as WWYR would be useful, she granted, when undertaking "a major 
project, then I want to use something like the Writing What You Read, if I 
waiit to touch on every single part [of the writing].'' 

Feasibility of use for large-scale assessment. Raters agreed that the 
comparison rubric had the capacity to be used reliably and with reasonable 
speed. In contrast, the feasibility and utility of WWYR for large-scale 
assessment were left as unanswered questions. The WWYR Overall 
Effectiveness scale was considered as a possible holistic replacement for 
comparison's General Competence, but there were concerns about the relation 
between the two judgments: Overall Effectiveness required a rater to judge the 
narrative's integration of other narrative elements, still a fairly analjrtic task 
that felt different in content and in process from a General Competence 
decision. Although raters acknowledged that they themselves had acquired 
expertise with WWYR in half a day, they nevertheless expressed concern about 
the staff development that would be required to implement a large-scale 
program based on WWYR assessment. 

Summary and Discussion 

The purpose of this study was to gather evidence of validity for a new 
narrative rubric designed to enhance the instructional value of writing 
assessments, but whose technical quality is unknown. The design of the 
Writing What You Read narrative rubric was prompted by the need for 
assessment tools that can guide instruction. The rubric differs from most 
narrative rubrics in its narrative-specific content and its developmental 
framework. Designed for classroom use and shown to impact elementary 
teachers* understandings of narrative (Gearhart et al., 1994), the rubric 
contains five analytic subscales for Theme, Character, Setting, Plot, and 
Communication, and a sixth, holistic scale for Overall Effectiveness. Each 
subscale contains six levels designed to match current understandings of 
children's narrative development. It has never been used to date for large- 
scale assessment. 

Our study evaluated the Writing What You Read rubric against an 
established rubric that has consistently demonstrated sound technical 
capabilities in large-scale use. Our findings regarding the reliability and 
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validity of both rubrics yielded promising but mixed evidence of the utility of the 
Writing What You Read rubric for large-scale assessment. 

In general, both rubrics were used consistently by raters when making 
judgments of eleme'^tary children's classroom narratives. Rater agreements 
for three of the Writing What You Read subscales (Overall Effectiveness, 
Character, Communication) were consistent with those obtained with the 
comparison rubric, while levels of agreement for the other three WWYR 
subscales (Theme, Setting, Plot) were somewhat lower. Although overall 
agreement for both sets of ratings was generally satisfactory, it was lower and 
more variable across rater pairs than reliabilities achieved for previous 
studies; the WWYE rubric did not exhibit as much variation across rater pairs 
as did the comparison rubric. Results of the generalizability analyses 
indicated that adequate levels of reliability for most scales of either rubric could 
be attained by doubly scoring each essay and aggregating the results. 
However, for WWYR, achieving adequate reliability for Setting (and, to a lesser 
degree, Theme and Plot) could require as man^; as four raters. 

The patterns of rater agreement obtained here may have been impacted by 
both study purpose and rubric content. First, raters were informed from the 
outset that they were participating in a study of two narrative rubrics, and they 
were atypically slow, methodical, and analytic in their approach to scoring, 
raising and pursuing issues that are often handled qu.ckly and dismissed in 
moderation sessions. Second, the WWYR rubric's representations of certain 
aspects of narrative competence were issues from the beginning of WWYR 
scoring. Although findings for raters' comments and the quantitative 
analyses were consistent only for Plot (in that both data sources pointed to 
content weaknesses), the overall findings do indicate a need to revisit aspects of 
the content of the rubric. 

There were several sources of evidence for the validity of the Writing What 
You Read rubric. First, the scores from both rubrics produced a pattern of 
increasing competence with grade level. Second, WWYR scores were highly 
correlated with the comparison scores, although there was some evidence for 
the distinctiveness of the two scales in the finding that cross-rubric subscale 
correlations were lower than within-rubric subscale correlations. Third, 
comparisons of raters' judgments made with both rubrics for the same 
narratives indicated some consistency in their decisions, although 
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disagreements in classifications of "competent*" narratives suggested 
distinctive definitions for competence. Finally, raters felt that the content of 
the WWYR rubric captured more aspects of narrative than the comparison 
rubric and had greater instructional potential. However, raters perceived 
some distinctive utility in the comparison Focus/Organization scale, and they 
recommended revisions of the scales for Plot, Overall Effectiveness, and 
Communication. They also expressed some concern about the professional 
development that would be required for WWYR scoring, despite their 
recognition that they had achieved understandings of the WWYR rubric and 
consensus in its use after only a two-hour training session. 

Thus our study has produced evidence that at least three scales of the 
Writing What You Read narrative rubric — an analytic writing rubric designed 
to enhance teachers' understandings of narrative and to inform instruction — 
can be used reliably and meaningfully in large-scale assessment of elementary 
level writing, provided that each narrative is rated by two raters. While we 
would have preferred that our analyses yield evidence of the technical 
soundness of all six scales, it is nevertheless heartening that a rubric as 
substantive as WWYR could produce findings this positive in an initial study. 

An important issue remains unresolved. Consistent with other studies of 
analytic scales, neither the WWYR nor the comparison rubric produced 
patterns of highly distinctive subscale judgments. We produced no empirical 
evidence for the subscales of either rubric. While raters agreed that WWYR 
scales had greater instructional utility than comparison scales and that each 
of the WWYR scales had relevance for instructional planning and classroom 
assessment, our quantitative findings suggest that subscale judgments may 
not provide a technically sound profile of students* strengths and weaknesses. 

We do not view these findings as a basis for rejecting an analytic 
framework for scoring, although the results may have implications for the 
value of subscale scores. Further research is needed to determine the factors 
that support or constrain distinctive subscale judgments — the structure and 
content of anal3rtic rubrics, the types of material to be rated, and the methods of 
rater training. If technical studies continue to demonstrate that subscale 
judgments can not be distinguished from overall competence ratings, we 
would argue for some "analytic** alternative to holistic scoring. One option 
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might be assignment of a single score, supplemented with rater commentary 
on strengths and weaknesses guided by checklists or open-ended prompts. 

Writing rubrics represent frameworks for interpretation of text and have 
potential to enhance teachers' knowledge and practice. When rubrics are 
designed to capture qualities of distinctive writing genres, then they have 
greater potential to support teachers' professional development, opportunities 
to learn in the classroom, and substantive interactions in moderation sessions. 
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